Micro-kernels for portable and efficient matrix multiplication in deep learning
نویسندگان
چکیده
Abstract We provide a practical demonstration that it is possible to systematically generate variety of high-performance micro-kernels for the general matrix multiplication ( gemm ) via generic templates which can be easily customized different processor architectures and micro-kernel dimensions. These employ vector intrinsics exploit SIMD (single instruction, multiple data) units in current general-purpose processors and, particular type problems encountered deep learning, deliver floating-point throughput rate on par with or even higher than obtained conventional, carefully tuned implementations linear algebra libraries (e.g., BLIS, AMD AOCL, ARMPL). Our work exposes structure template-based ARM Neon (128-bit SIMD), SVE (variable-length SIMD) Intel AVX512 (512-bit showing considerable performance an NVIDIA Carmel (ARM Neon), Fujitsu A64FX SVE) EPYC 7282 (256-bit SIMD).
منابع مشابه
On Composing Matrix Multiplication from Kernels
Matrix multiplication is often treated as a basic unit of computation in terms of which other operations are implemented, yielding high performance. In this paper initial evidence is provided that there is a benefit gained when lower level kernels, from which matrix multiplication is composed, are exposed. In particular it is shown that matrix multiplication itself can be coded at a high level ...
متن کاملWriting a performance-portable matrix multiplication
There are several frameworks that, while providing functional portability of code across different platforms, do not automatically provide performance portability. As a consequence, programmers have to hand-tune the kernel codes for each device. The Heterogeneous Programming Library (HPL) is one of these libraries, but it has the interesting feature that the kernel codes, which implement the co...
متن کاملEfficient Matrix Multiplication in Hadoop
In a typical MapReduce job, each map task processing one piece of the input file. If two input matrices are stored in separate HDFS files, one map task would not be able to access the two input matrices at the same time. To deal with this problem, we propose a efficient matrix multiplication in Hadoop. For dense matrices, we use plain row major order to store the matrices on HDFS; For sparse ma...
متن کاملStructured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors
We introduce a variational Bayesian neural network where the parameters are governed via a probability distribution on random matrices. Specifically, we employ a matrix variate Gaussian (Gupta & Nagar, 1999) parameter posterior distribution where we explicitly model the covariance among the input and output dimensions of each layer. Furthermore, with approximate covariance matrices we can achie...
متن کاملImplementing Efficient, Portable Computations for Machine Learning
Implementing E cient, Portable Computations for Machine Learning
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: The Journal of Supercomputing
سال: 2022
ISSN: ['0920-8542', '1573-0484']
DOI: https://doi.org/10.1007/s11227-022-05003-3